Microevolutionary processes analysis in the Lithuanian genome

Differences in the relative fitness of genomic variants are foundational, without these, neither natural selection nor adaption can exist. This research analyzed two microevolutionary forces, mutations, and positive selection, using whole genome sequencing data from Lithuanians across three generations: newborns (generation I), their parents (generation II), 60 years old Lithuanians, and the root ancestors (generation III). The main objective was to determine the frequency of mutations under selection in modern humans and how allele frequencies change across generations. Our results show that going through all the landscapes of the relative fitness on each chromosome, the general relative fitness background pattern remains the same in analysed generations. However, the tendency of relative fitness to decrease, in general, is noted. We hypothesize that the de novo genome variants or genome variants with a very low frequency that formed in the previous generation did not have time to be as affected by natural selection, thus, in the following generation, the force of natural selection acting on them is greater and their cumulative relative fitness also decreases. The strong natural selection pressure on the genetic regions that encode the NEGR1 and PTPN1/PTNP21 genes were also identified, highlighting the evolution of the Lithuanian population’s genome over generations, and possible genomic “deficiencies” for better adaptation.


Results
Identification of positive selection signals. The genome-wide distribution signals for each comparison are summarized in Fig. S1. We detected 17 common candidate regions with signatures of recent selection passing from one generation (LTII) to another in the Lithuanian population (LTI) ( Table 1, Fig. 1). Most recent signals were found when comparing Lithuanians (both generations common regions) to the CEU population (Fig. 1).
Biological pathways for genes near the targets of selection included genes that are involved in immune function (HLA-DRB1, FBXL7, and PLD1), metabolism (PLD1), cellular response to stimuli (TNIK), infectious disease (ADCY8), muscle contraction (ACTN2), and gene expression (PTPN1, ZNF717, and ZNF557). The terms identified using DAVID, of selected genes, are listed in Table S1. No significantly enriched terms were found in GO (FDR < 0.05) enrichment analysis.
A total of 28 strong candidate regions for older signatures of selection were identified in the two generations of the Lithuanian population when using the Tajima's D statistic (Table 2). Those results were compared with Urnikyte et al. 2019 1 , who published the selection results, and analyzed the genotyping data of 424 60-year-old Lithuanian genome-wide high-density SNP genotype data, which could be considered as the third generation for comparison. In total, eight regions were found in all three generations ( Table 2). Between these old selection signatures passing through the generations in Lithuanians were genes related to the efficient digestion of dietary fats, and in chromosome 10, comprising the PNLIP and PNLIPRP3 genes that may probably result from local dietary selection pressures in the Lithuanian population. Other genes were related to olfactory receptors on chromosome 9, OR1L1, and OR1L3, the immune response on chromosome 11, IL18BP, vitamin D-binding (GC), and human skin color (BNC2).
Among the results based on Tajima's D statistics, two significant (FDR < 0.05) Gene Ontology (GO) terms were identified: one BP term, and one molecular function (MF) term (Table S2). The enriched biological process was associated with DNA single-strand break repair, and the molecular function includes damaged DNA binding.
The turnover of relative fitness for whole-genome variants. Having each identified genomic variant frequency, we were able to evaluate the relative fitness values. Composing the values of relative fitness for each variant on a chromosome, the landscapes of relative fitness for each chromosome were formed (Fig. 2).
The analysis of the compared relative fitness landscapes between the second (LTII) and first generation (LTI) revealed 50 genomic regions (Table 3) where the relative fitness was significantly smaller or higher in the first generation than in the second one. Going through all the landscapes of relative fitness on each chromosome, the pattern of general relative fitness background remains the same in both generations. However, the tendency for relative fitness to decrease, in general, is noted (Fig. 2).
The genomic regions where relative fitness differs from the background are distinguished by 134 proteincoding genes. The relative fitness is significantly decreased in genes that are involved in the numerous cellular processes that are initiated by extracellular stimuli that work through G protein-coupled receptors (ARHGEF4), Table 1. Common candidate regions of selection detected with XP-EHH and F ST for the Lithuanian population LTI and LTII. a Number of significant SNPs that were located at the extreme 0.1% of the empirical distribution for XP-EHH, and at least one SNP in the region had an F ST p-value < 0.01, for LTI and LTII generations, separated by slash. *LTI/II-first generation (newborns), and second generations (parents) of the Lithuanian population samples.

Genome coordinates
Genes Population (SNPs a ) www.nature.com/scientificreports/ signaling their intracellular transport (MARCHF11), which may be necessary for the long-term survival of nociceptive and autonomic ganglion neurons (RETREG1), the intrinsic apoptotic signaling pathway in response to oxidative stress (ZNF622), etc. Decreased relative fitness was detected in genes that are components of a heterotrimeric cell cycle checkpoint complex, known as the 9-1-1 complex, which is activated to stop cell cycle progression in response to DNA damage or incomplete DNA replication, also, in the PRLR gene, which may function to modulate the endocrine and autocrine effects of prolactin in normal tissue, and cancer or genes, the variants of which have been associated with retinitis pigmentosa. Increased relative fitness was detected in genes that regulate the expression of several genes involved in pituitary development and hormone expression (POU1F1), the signaling pathway there coding protein PTPN21 regulates a variety of cellular processes including cell growth, differentiation, the mitotic cycle, and oncogenic transformation and insulin regulation (SLC2A4).

Discussion
Our analyses demonstrate that distinct microevolutionary scenarios can generate very similar and realistic biodiversity patterns (e.g., the latitudinal diversity gradient). One of the biggest hits that we saw of selection was found in a ~ 131 kb region in chromosome 6, when comparing Lithuanian groups with CEU and FIN populations, which comprise the HLA-DRB1 and HLA-DRB6 genes, with the main function being to present pathogen-derived antigenic peptides to T lymphocytes. We identified three non-synonymous variants in the HLA-DRB1 gene: rs9270302, NC_000006.11:g.32557479G > A, rs9270303, NC_000006.11:g.32557483 T > C, and rs707953, NC_000006.11:g.32557506 T > C. Lithuanians presented a high frequency (0.79) for the derived A allele at rs9270302, which is found at low frequencies in FIN (0.11), CEU (0.06), and YRI (0.29). The derived allele C at rs707953 also presents a high frequency (0.79) in Lithuanians and is found at intermediate frequencies in FIN (0.46), CEU (0.47), and YRI (0.50). The measured LD for these pairs of SNPs in plink 11 showed complete LD between alleles. The frequencies of the variant rs9270302 were ~ 0.69 in CEU to 0.85 in Lithuanians. Fengxue Yu (2017) found that the variant rs9270303 was strongly associated with hepatitis B virus-associated hepatocellular carcinoma (HBV-HCC), however, its role still needs to be confirmed 12 . However, our findings provide fundamental data that need further study to confirm the roles of these variants. One of these hypotheses could be that those polymorphisms confer specific humoral immunity against common pathogens.
In some genes, we have identified non-synonymous variants. In the COL24A1 gene: rs11161747 and NC_000001.10:g.86591837G > A may participate in regulating type I collagen fibrillogenesis at specific anatomical locations during fetal development 13 , in the BTLA gene: rs9288952, NC_000003.11:g.112185025G > A, with a function to inhibit lymphocytes during the immune response in the PTPRN2 14 gene: rs1130495, NC_000007.13:g.157959911A > G, plays a role in vesicle-mediated secretory processes, and it is required for the accumulation of normal levels of insulin-containing vesicles and the prevention of their degradation, in the OR1L4 gene, rs2215530, NC_000009.11:g.125486968G > A, and odorant receptor, and in the PNLIP gene: rs2915748, NC_000010.10:g.118313265T > C. www.nature.com/scientificreports/ Another point of view of this study's whole-genome analysis of microevolutionary processes was an analysis of the relative fitness turnover between two generations. Relative fitness shows how much fitness on a genotype has been compared to the maximum fitness, and so whether it will increase or decrease. Here, the relative fitness is a function not only of the individual, but also of all the generations in which they have been measured, and the relative fitness will change as the gene variant frequencies in the population change. Concerning the fitness of various sequence changes, not at the same speed as evolution occurs, the microevolution in the generations is an attempt to keep the most positive functional effect of each genomic variant in an ever-morphing landscape 15 . During this study, the aim was to find out how genomic and environmental elements determine the differences in relative fitness landscapes between generations, and in which direction the allele frequency changes from generation to generation in the Lithuanian genome. This study showed that going through all the landscapes of the relative fitness on each chromosome, the general relative fitness background pattern remains the same in both generations. However, the tendency of relative fitness to decrease, in general, is noted. We hypothesize that the de novo genome variants or genome variants with a very low frequency that formed in the previous generation did not have time to be as affected by natural selection, thus, in the following generation, the force of natural selection acting on them is greater and their cumulative relative fitness also decreases. Therefore, during the process of microevolution, the genome variants that are not adaptive enough are pushed out through time. Of course, we cannot claim that genomic variants will certainly be removed. On contrary, considering the effects of spatial variation 12,16 in fitness and the fact that selection over many generations is a multiplicative process 17 , the genomic variant can become adaptive after all.
Surely, the comparison of relative fitness between the generations distinguished some specific genomic regions. Those genomic variants are necessary for the correct cellular signal transfer processes, DNA synthesis, and replication. In summary, the relative fitness decreased in the genes for which a mutation could significantly increase the risk of disrupting an important molecular process. A detailed description of gene functions is presented in Table S3. For example, the genomic variants in ZNF622, PRLR with decreased relative fitness show how important it is to protect an individual's genome and to decrease variant rates in the genome: in the case of a ZNF622 gene, if a mutation would be fixed in the genome, there would be a risk of having an imbalance between the reactive oxygen species and the antioxidant defense system. While it is known that oxidative stress is involved in most of the pathological states and diseases 18 , in the case of a PRLR, a fixed and potentially pathogenic genomic variant could disturb the modulation of the endocrine and autocrine effects of prolactin in normal tissue and  www.nature.com/scientificreports/ cancer 19 , in the cases of RP3 and RP1, it would disturb the structure or function of a protein that localizes to the outer segments of rod photoreceptors, and that is essential for their viability, mutations in this gene cause autosomal dominant retinitis pigmentosa. However, there was also an increase in the relative fitness detected in the genomic region, with the TTC8 gene, whose mutations are also associated with retinitis pigmentosa. Therefore, this confirms what we have mentioned earlier, that in the general population, through microevolution, a cumulative relative fitness of genomic variants varies enough to maintain relative fitness equilibrium. According to the data analysis results, regardless of the whole-genome analysis method-selection pressure analysis based on SNPs or relative fitness analysis on each identified genomic variant, a few genomic regions where NEGR1 and PTPN1/PTNP21 genes are placed, coincided. NEGR1 acts on the positive regulation of neuron projection development, and PTPN1/PTNP21 codes PTPs that are known to be signaling molecules that regulate a variety of cellular processes, including cell growth, differentiation, mitotic cycle, and oncogenic transformation. The strong pressure of natural selection on these regions highlights the development of the genome of the Lithuanian population over generations, and possible genomic "deficiencies" for better adaptability. Since the relative fitness in the overlapping regions is not unambiguous-in the genome region where the NEGR1 gene was identified, the relative fitness decreased, and in the case of PTNP21, it increased, this led to the conclusion www.nature.com/scientificreports/ www.nature.com/scientificreports/ that due to the reproducibility and complementarity of the results, both of the analysis methods used in this study are suitable for monitoring microevolutionary processes. There are some limitations to this study. Because of the hypothesis-driven nature of this study, the sample size is relatively small due to economical limitations. In addition, more generations need to be included, which is impossible due to the human species. Despite all limitations, we have identified the candidate regions for selection in different Lithuanian generations, and the adaptive alleles that need to be validated.
In summary, in this study, we have shown that current macroevolutionary models may fail to distinguish between different microevolutionary scenarios. Therefore, establishing causal relationships between ecological factors and macroevolutionary rates or patterns requires rigorous evaluations. Future studies that incorporate microevolutionary processes into the current modeling approaches are needed.

Materials and methods
Sampling and DNA sequencing. We applied the SNP data of 25 trios from Lithuania (25 newborns, 25 mothers, and 25 fathers) obtained by WGS. Inclusion criteria, DNA extraction, WGS data processing were described previously 20 . All participants and their LAR/ parents provided informed consent. All experiments were performed in accordance with the Declaration of Helsinki, and all research methods were carried out in accordance with appropriate regulations and guidelines.
Positive selection analysis. To detect recent signals of positive selection, our original genome sequencing data were merged with the data downloaded from the 1000 Genomes Project Phase3 dataset (gs://genomicspublic-data/1000-genomes-phase-3, access in 2022) 21 . Data merging was performed with bcftools 22 merge tool. SNP with > 20% missing data (max-missing) and SNPs with minor allele frequency (MAF) < 0.01 (minor allele frequency) were excluded. After merging we were left with 1,443,372 common SNPs. Haplotypes for the analysis were constructed with SHAPEIT2 23 . The signatures of recent or ongoing positive selection were investigated using the locus fixation index (F ST ) 24 and the cross-population extended haplotype homozygosity (XP-EHH) 25 27 . For each comparison, an XP-EHH per SNP was obtained, and XP-EHH values of > 2 were considered as being indicative of selection. The SNPs located in the top 0.1% of the XP-EHH empirical distribution were considered as being significant ones. Significant regions were formed by combining significant SNPs that were less than 200 kb apart. We were interested only in those signals detected in the Lithuanian population samples. In each comparison, we considered as the top candidates for recent selection those genomic regions presenting at least two SNPs over the top 0.1% XP-EHH empirical values, and a minimum of one SNP with an F ST rank score p-value of < 0.01.
The older signals of selection were inferred through Tajima's D statistic 28 , and a calculation with the PopGenome 29 package implemented in R v. 4.3.0 considering 100 kb sliding-window size and moving step of 10 kb 29 . Negative Tajima´s D values were identified considering the ranc of the score in the genomic distribution. For further analysis values with empirical p-value < 0.01 were used. P-values of all statistics were calculated using the rank of a score in the genomic distribution as described in Pybus M. et al. 2014 30 . The regions under selection were annotated with ANNOVAR 31 using GRCh37 (hg19), dbSNP151 32 , RefSeqGene, and CADD (Combined Annotation Dependent Depletion), version 1.347 33 . The enrichment of biological processes in selected genes was tested using DAVID (Database for Annotation, Visualization, and Integrated Discovery) 34 and Reactome v.3.7 35 . Linkage disequilibrium between SNPs were measured using plink v.1.07 the command -ld. Manhattan plots and a venn diagram were created with R v. 4.3.0.
Structure of the relative fitness analysis. For relative fitness analysis, three groups of the general population without any additional health issues were analyzed. The third group consisted of the general European population (CEU, FIN, and YRI) for which data were derived from the 1000 Genomes Project Phase3 dataset 21 . This group was used as a reference generation no. 3 in this study (RIII). The second group was formed of adult individuals of Lithuanian origin (LTII) 20 . The first group of subjects are full-term healthy newborns from the general Lithuanian population, born in 2019-2020 (LTI).
Given the abundance of the identified variants for each group, all variants were grouped according to the genomic coordinates on the chromosomes. The calculation of relative fitness values was performed for the second and third generations in the study, comparing the frequency of each identified variant with the frequency of genomic variants in the "reference" first generation, regardless of its mechanism of formation. If the genomic variant that was identified in the second or third generation was not found in the "reference" first generation, then its frequency in the "reference" generation was considered to be the frequency of a single de novo mutation (1 × 10 −8 ). The frequency of the genomic variant in the next generation is where q 1 2 is the frequency of the genomic variant in the second or third generation, q 0 2 is the genomic variant frequency in the "reference" first generation, and S is the strength of natural selection.
Additionally, from the genome sequencing data, we know the frequency of the genomic variants, and the strength of the natural selection that occurs through the generations is defined as follows: